Word Normalization in Twitter Using Finite-state Transducers
نویسندگان
چکیده
This paper presents a linguistic approach based on weighted-finite state transducers for the lexical normalisation of Spanish Twitter messages. The system developed consists of transducers that are applied to out-of-vocabulary tokens. Transducers implement linguistic models of variation that generate sets of candidates according to a lexicon. A statistical language model is used to obtain the most probable sequence of words. The article includes a description of the components and an evaluation of the system and some of its parameters.
منابع مشابه
Morphological Disambiguation and Text Normalization for Southern Quechua Varieties
We built a pipeline to normalize Quechua texts through morphological analysis and disambiguation. Word forms are analyzed by a set of cascaded finite state transducers which split the words and rewrite the morphemes to a normalized form. However, some of these morphemes, or rather morpheme combinations, are ambiguous, which may affect the normalization. For this reason, we disambiguate the morp...
متن کاملUse of Weighted Finite State Transducers inPart of Speech
This paper addresses issues in part of speech disambiguation using nite-state transducers and presents two main contributions to the eld. One of them is the use of nite-state machines for part of speech tagging. Linguistic and statistical information is represented in terms of weights on transitions in weighted nite-state transducers. Another contribution is the successful combination of techni...
متن کاملUnsupervised Text Normalization Using Distributed Representations of Words and Phrases
Text normalization techniques that use rule-based normalization or string similarity based on static dictionaries are typically unable to capture domain-specific abbreviations (custy, cx → customer) and shorthands (5ever, 7ever → forever) used in informal texts. In this work, we exploit the property that noisy and canonical forms of a particular word share similar context in a large noisy text ...
متن کاملMyanmar Number Normalization for Text-to-Speech
--Text Normalization is an essential module for Text-to-Speech (TTS) system as TTS systems need to work on real text. This paper describes Myanmar number normalization designed for Myanmar Text-to-Speech system. Semiotic classes forMyanmar language are identified by the study of Myanmar text corpus and Weighted Finite State Transducers (WFST) based Myanmar number normalization is implemented. N...
متن کاملUse of Weighted Finite State Transducers in Part of Speech Tagging
This paper addresses issues in part of speech disambiguation using finite-state transducers and presents two main contributions to the field. One of them is the use of finite-state machines for part of speech tagging. Linguistic and statistical information is represented in terms of weights on transitions in weighted finite-state transducers. Another contribution is the successful combination o...
متن کامل